In this exercise, I hope you learn the following:

  1. How to create an R project
  2. How to install/load the required libraries
  3. How to import/read data to Rstudio
  4. Check the data (Summary statistics, variable names…)
  5. Use the dplyr package verbs to manipulate data
  6. Summarise and aggregate the data
  7. Visualize the data using ggplot2

To demonstrate and achieve the above objectives, we will be using 15-min flow data from the EA. The gauging station is located on river soar at Littlethorpe.

Before starting if you don’t have R and Rstudio installed in your machine, follow the links below to download R and Studio (install R first before installing Rstudio)

1. Create an R project

It is a good practice to create an R project before starting any R related project, this allows you to:

  • Keep a set of related data, analyses, and text self-contained in a single folder
  • Use relative paths to files rather than absolute paths
  • Easily move your project around on your computer and share it with others
To create an R project, let’s first create an empty folder named for example my_Rproject (say in your Desktop) and then create three other sub-folders named data, scripts, and plots within my_Rproject folder. Next, open Rstudio, from the main menu, go to file, new project, choose Existing Directory option and then browse to the created my_Rproject folder and click Create Project.


2. Data

In general,R can either read the data from an online or local source, but for the purpose of this demo, we will download flow data from this link and place it in the sub-folder data

3. Install/load libraries

Did you say how? Well, to install packages, two main options are available:

  1. From the Packages window in Rstudio then install and type the package name
  2. From the console/code editor, use the the function install.packages() for example install.packages("tidyverse")

Ok! Now install the following packages needed for the rest of this demo:

install.packages("tidyverse") install.packages("lubridate") install.packages("dygraphs") install.packages("glm2") install.packages("xts")

Next let’s load these installed packages:

# load the necessary libraries
library(tidyverse)
library(lubridate)
library(dygraphs)
library(glm2)
library(xts)

4. Read the data

The data contains some rows (20) that we don’t need (info related to the gaging station) so we will skip them using the skip argument from read_csv() function to skip the first 20 rows.

# First list files in the data folder (we know in this case that we have just 1 file but just in case you have several files)
files <- list.files("./data")
files
## [1] "Littlethorpe 4082 15 Minute Flow 1983 to 2011.csv"
# Read the file and use skip argument (notice that we use paste0 function to Concatenate 2 Strings)
flow <- read_csv(paste0("./data/", files[1]), skip = 20)

5. Variable names

The variable names are a bit messy and need some adjustments. let’s change them using the function colnames()

# Change variable names 
colnames(flow) <- c("time_stamp","flow","quality","interp","tags","comments")
# print flow
flow
## # A tibble: 952,870 x 6
##    time_stamp           flow quality interp tags  comments
##    <chr>               <dbl> <chr>    <int> <chr> <chr>   
##  1 01/04/1983 09:00:00  1.18 G          102 <NA>  <NA>    
##  2 01/04/1983 09:15:00  1.17 G          102 <NA>  <NA>    
##  3 01/04/1983 09:30:00  1.16 G          102 <NA>  <NA>    
##  4 01/04/1983 09:45:00  1.15 G          102 <NA>  <NA>    
##  5 01/04/1983 10:00:00  1.15 G          102 <NA>  <NA>    
##  6 01/04/1983 10:15:00  1.15 G          102 <NA>  <NA>    
##  7 01/04/1983 10:30:00  1.15 G          102 <NA>  <NA>    
##  8 01/04/1983 10:45:00  1.15 G          102 <NA>  <NA>    
##  9 01/04/1983 11:00:00  1.16 G          102 <NA>  <NA>    
## 10 01/04/1983 11:15:00  1.16 G          102 <NA>  <NA>    
## # ... with 952,860 more rows

6. Summary statistics

let’s check how the data look like and get some descriptive statistics

# get summary stat of data  
summary(flow)                
##   time_stamp             flow          quality              interp   
##  Length:952870      Min.   : 0.023   Length:952870      Min.   :102  
##  Class :character   1st Qu.: 0.394   Class :character   1st Qu.:102  
##  Mode  :character   Median : 0.648   Mode  :character   Median :102  
##                     Mean   : 1.263                      Mean   :102  
##                     3rd Qu.: 1.320                      3rd Qu.:102  
##                     Max.   :26.200                      Max.   :102  
##                     NA's   :8                                        
##      tags             comments        
##  Length:952870      Length:952870     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 
# or 
glimpse(flow) 
## Observations: 952,870
## Variables: 6
## $ time_stamp <chr> "01/04/1983 09:00:00", "01/04/1983 09:15:00", "01/0...
## $ flow       <dbl> 1.18, 1.17, 1.16, 1.15, 1.15, 1.15, 1.15, 1.15, 1.1...
## $ quality    <chr> "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "...
## $ interp     <int> 102, 102, 102, 102, 102, 102, 102, 102, 102, 102, 1...
## $ tags       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ comments   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
# data dimensions
dim(flow) 
## [1] 952870      6
# number of columns
ncol(flow)
## [1] 6
# number of rows
nrow(flow)
## [1] 952870

7. Column contents

For example, we are going to check the different comments of the variable comment using unique() function to show unique elements

# Check the comments variable                 
unique(flow$comments) 
## [1] NA                                                                              
## [2] "manual edit."                                                                  
## [3] "any spurious data may be the result of ice in the stilling well and/or channel"

8. Date and time

From the summary results you can see that the class of the variable time_stamp is a character so it needs to be changed to datetime class. But before doing so, let’s try some examples of how you can use some functions from lubridate package to parse datetimes

## Examples of date time parsing:
my_brthdy <- "13/5/1984 13;20-00"
parsed_date <- dmy_hms(my_brthdy)
parsed_date
## [1] "1984-05-13 13:20:00 UTC"
# extract year
year(parsed_date)
## [1] 1984
# extract month
month(parsed_date)
## [1] 5
# extract day
day(parsed_date)
## [1] 13
# extract minutes
minute(parsed_date)
## [1] 20

9. Application for flow data

9.1 mutate verb:

Here we are going overwrite the column time_stamp using mutate() and we use dmy_hms() to parse the time stamp.

# first thing to do is to parse date time 
# in order to do that we are going to use functions dmy_hms() and the mutate() to overwrite the current date time variable
flow1 <- flow %>% mutate(time_stamp = dmy_hms(time_stamp))
flow1
## # A tibble: 952,870 x 6
##    time_stamp           flow quality interp tags  comments
##    <dttm>              <dbl> <chr>    <int> <chr> <chr>   
##  1 1983-04-01 09:00:00  1.18 G          102 <NA>  <NA>    
##  2 1983-04-01 09:15:00  1.17 G          102 <NA>  <NA>    
##  3 1983-04-01 09:30:00  1.16 G          102 <NA>  <NA>    
##  4 1983-04-01 09:45:00  1.15 G          102 <NA>  <NA>    
##  5 1983-04-01 10:00:00  1.15 G          102 <NA>  <NA>    
##  6 1983-04-01 10:15:00  1.15 G          102 <NA>  <NA>    
##  7 1983-04-01 10:30:00  1.15 G          102 <NA>  <NA>    
##  8 1983-04-01 10:45:00  1.15 G          102 <NA>  <NA>    
##  9 1983-04-01 11:00:00  1.16 G          102 <NA>  <NA>    
## 10 1983-04-01 11:15:00  1.16 G          102 <NA>  <NA>    
## # ... with 952,860 more rows

You can notice that the time_stamp variable class is changed to dttm

9.2 Select() verb to choose the relevant variables

For example let’s say we wanted to keep just time_stamp,flow,quality,interp,tags variables. You can either write the variable number or the variable name

flow2 <- flow1 %>% select(time_stamp,flow,quality,interp,tags) 
#print flow3
flow2
## # A tibble: 952,870 x 5
##    time_stamp           flow quality interp tags 
##    <dttm>              <dbl> <chr>    <int> <chr>
##  1 1983-04-01 09:00:00  1.18 G          102 <NA> 
##  2 1983-04-01 09:15:00  1.17 G          102 <NA> 
##  3 1983-04-01 09:30:00  1.16 G          102 <NA> 
##  4 1983-04-01 09:45:00  1.15 G          102 <NA> 
##  5 1983-04-01 10:00:00  1.15 G          102 <NA> 
##  6 1983-04-01 10:15:00  1.15 G          102 <NA> 
##  7 1983-04-01 10:30:00  1.15 G          102 <NA> 
##  8 1983-04-01 10:45:00  1.15 G          102 <NA> 
##  9 1983-04-01 11:00:00  1.16 G          102 <NA> 
## 10 1983-04-01 11:15:00  1.16 G          102 <NA> 
## # ... with 952,860 more rows

Or we can just remove the the comments variable which give similar result.

flow1 %>% select(-comments) 
## # A tibble: 952,870 x 5
##    time_stamp           flow quality interp tags 
##    <dttm>              <dbl> <chr>    <int> <chr>
##  1 1983-04-01 09:00:00  1.18 G          102 <NA> 
##  2 1983-04-01 09:15:00  1.17 G          102 <NA> 
##  3 1983-04-01 09:30:00  1.16 G          102 <NA> 
##  4 1983-04-01 09:45:00  1.15 G          102 <NA> 
##  5 1983-04-01 10:00:00  1.15 G          102 <NA> 
##  6 1983-04-01 10:15:00  1.15 G          102 <NA> 
##  7 1983-04-01 10:30:00  1.15 G          102 <NA> 
##  8 1983-04-01 10:45:00  1.15 G          102 <NA> 
##  9 1983-04-01 11:00:00  1.16 G          102 <NA> 
## 10 1983-04-01 11:15:00  1.16 G          102 <NA> 
## # ... with 952,860 more rows

9.3 Extract year, month, day, and time from the the time_stamp variable

# In order to subset the data for example annually we would need to separate the year, month, day, hour from time_stamp

flow3 <- flow2 %>% mutate(year = year(time_stamp),
                            month = month(time_stamp),
                            day = day(time_stamp),
                            hour = hour(time_stamp),
                            min = minute(time_stamp))
# Print flow3 
flow3
## # A tibble: 952,870 x 10
##    time_stamp           flow quality interp tags   year month   day  hour
##    <dttm>              <dbl> <chr>    <int> <chr> <dbl> <dbl> <int> <int>
##  1 1983-04-01 09:00:00  1.18 G          102 <NA>   1983  4.00     1     9
##  2 1983-04-01 09:15:00  1.17 G          102 <NA>   1983  4.00     1     9
##  3 1983-04-01 09:30:00  1.16 G          102 <NA>   1983  4.00     1     9
##  4 1983-04-01 09:45:00  1.15 G          102 <NA>   1983  4.00     1     9
##  5 1983-04-01 10:00:00  1.15 G          102 <NA>   1983  4.00     1    10
##  6 1983-04-01 10:15:00  1.15 G          102 <NA>   1983  4.00     1    10
##  7 1983-04-01 10:30:00  1.15 G          102 <NA>   1983  4.00     1    10
##  8 1983-04-01 10:45:00  1.15 G          102 <NA>   1983  4.00     1    10
##  9 1983-04-01 11:00:00  1.16 G          102 <NA>   1983  4.00     1    11
## 10 1983-04-01 11:15:00  1.16 G          102 <NA>   1983  4.00     1    11
## # ... with 952,860 more rows, and 1 more variable: min <int>

10. Missing data

## check for NAs 
# 1- check if yes or no there is missing data 
anyNA(flow3$flow)
## [1] TRUE
# 2- check how many missing values
sum(is.na(flow3$flow))
## [1] 8
# 3- locate the missing values using the quality variable
missed <- flow3 %>% filter(quality == "M")
missed
## # A tibble: 8 x 10
##   time_stamp           flow quality interp tags   year month   day  hour
##   <dttm>              <dbl> <chr>    <int> <chr> <dbl> <dbl> <int> <int>
## 1 1983-08-01 08:15:00    NA M          102 <NA>   1983  8.00     1     8
## 2 1984-04-01 08:45:00    NA M          102 <NA>   1984  4.00     1     8
## 3 1985-01-01 08:15:00    NA M          102 <NA>   1985  1.00     1     8
## 4 1985-04-01 08:45:00    NA M          102 <NA>   1985  4.00     1     8
## 5 1985-11-01 08:15:00    NA M          102 <NA>   1985 11.0      1     8
## 6 1985-12-01 08:45:00    NA M          102 <NA>   1985 12.0      1     8
## 7 1985-12-01 23:30:00    NA M          102 <NA>   1985 12.0      1    23
## 8 1986-07-01 23:45:00    NA M          102 <NA>   1986  7.00     1    23
## # ... with 1 more variable: min <int>
# or look for NAs
missed <- flow3 %>% filter(is.na(flow))
missed
## # A tibble: 8 x 10
##   time_stamp           flow quality interp tags   year month   day  hour
##   <dttm>              <dbl> <chr>    <int> <chr> <dbl> <dbl> <int> <int>
## 1 1983-08-01 08:15:00    NA M          102 <NA>   1983  8.00     1     8
## 2 1984-04-01 08:45:00    NA M          102 <NA>   1984  4.00     1     8
## 3 1985-01-01 08:15:00    NA M          102 <NA>   1985  1.00     1     8
## 4 1985-04-01 08:45:00    NA M          102 <NA>   1985  4.00     1     8
## 5 1985-11-01 08:15:00    NA M          102 <NA>   1985 11.0      1     8
## 6 1985-12-01 08:45:00    NA M          102 <NA>   1985 12.0      1     8
## 7 1985-12-01 23:30:00    NA M          102 <NA>   1985 12.0      1    23
## 8 1986-07-01 23:45:00    NA M          102 <NA>   1986  7.00     1    23
## # ... with 1 more variable: min <int>

Now let’s check the number of records each year and see how complete our data is

#Calculate the number of records each year 
records <- flow3 %>% group_by(year) %>% tally()
records
## # A tibble: 29 x 2
##     year     n
##    <dbl> <int>
##  1  1983 11710
##  2  1984 26365
##  3  1985 20637
##  4  1986 17569
##  5  1987 35040
##  6  1988 35136
##  7  1989 35040
##  8  1990 35040
##  9  1991 35040
## 10  1992 35136
## # ... with 19 more rows

You can see for example that the first 3 years are complete so you can choose to remove them

11. Summarizing the data

To illustrate the use of group_by() and summarize() we can for example aggregate the data to hourly time stamp or computing the annual maximum flow

# Aggregate the data to hourly
hflow <- flow3 %>% group_by(year,month,day,hour) %>% 
                        summarize(flow = mean(flow, na.rm = TRUE), 
                                  datetime = first(time_stamp))
# print hflow
hflow
## # A tibble: 238,219 x 6
## # Groups: year, month, day [?]
##     year month   day  hour  flow datetime           
##    <dbl> <dbl> <int> <int> <dbl> <dttm>             
##  1  1983  4.00     1     9  1.16 1983-04-01 09:00:00
##  2  1983  4.00     1    10  1.15 1983-04-01 10:00:00
##  3  1983  4.00     1    11  1.17 1983-04-01 11:00:00
##  4  1983  4.00     1    12  1.17 1983-04-01 12:00:00
##  5  1983  4.00     1    13  1.17 1983-04-01 13:00:00
##  6  1983  4.00     1    14  1.19 1983-04-01 14:00:00
##  7  1983  4.00     1    15  1.18 1983-04-01 15:00:00
##  8  1983  4.00     1    16  1.18 1983-04-01 16:00:00
##  9  1983  4.00     1    17  1.17 1983-04-01 17:00:00
## 10  1983  4.00     1    18  1.17 1983-04-01 18:00:00
## # ... with 238,209 more rows
# Calculate the annual max flow
am_flow <- flow3 %>% group_by(year) %>% 
  summarize(am_flow = max(flow, na.rm = TRUE))
am_flow
## # A tibble: 29 x 2
##     year am_flow
##    <dbl>   <dbl>
##  1  1983   13.0 
##  2  1984   15.7 
##  3  1985    7.34
##  4  1986    9.55
##  5  1987   16.3 
##  6  1988   23.5 
##  7  1989   20.6 
##  8  1990   17.5 
##  9  1991   16.3 
## 10  1992   16.6 
## # ... with 19 more rows
# or for example Choosing the number of days where the flow exceeds the 99.5th percentile 
# 1: find the Q99.5   
Q995 <- quantile(hflow$flow, prob = 0.995, na.rm = T)

# 2: values exceeding the Q995  
Q995_flow <- hflow %>% filter(flow >= Q995)
Q995_flow
## # A tibble: 1,198 x 6
## # Groups: year, month, day [124]
##     year month   day  hour  flow datetime           
##    <dbl> <dbl> <int> <int> <dbl> <dttm>             
##  1  1983  4.00    11    13  11.8 1983-04-11 13:00:00
##  2  1983  4.00    11    14  12.1 1983-04-11 14:00:00
##  3  1983  4.00    11    15  12.3 1983-04-11 15:00:00
##  4  1983  4.00    11    16  12.2 1983-04-11 16:00:00
##  5  1983  4.00    11    17  12.1 1983-04-11 17:00:00
##  6  1983  4.00    11    18  12.0 1983-04-11 18:00:00
##  7  1983  5.00     2     5  12.4 1983-05-02 05:00:00
##  8  1983  5.00     2     6  12.6 1983-05-02 06:00:00
##  9  1983  5.00     2     7  12.8 1983-05-02 07:00:00
## 10  1983  5.00     2     8  12.8 1983-05-02 08:00:00
## # ... with 1,188 more rows
# 3 find the number of hours each year 
exceedance <- Q995_flow %>% group_by(year) %>% 
  summarise(events = n())
exceedance
## # A tibble: 23 x 2
##     year events
##    <dbl>  <int>
##  1  1983     16
##  2  1984     16
##  3  1987     49
##  4  1988     35
##  5  1989     74
##  6  1990     25
##  7  1991     18
##  8  1992     50
##  9  1993    123
## 10  1994     42
## # ... with 13 more rows
# select just years where no values exceed the threshold (this example with show also how you can )
non_exce <- hflow %>% ungroup(year, month, day) %>%
  select(year) %>% distinct(year) %>%  
  filter(!year %in% exceedance$year)
# add 0 if no event  
non_exce$events <- 0 
exceedance <- full_join(exceedance,non_exce, by = c("year", "events")) %>% arrange(year)
exceedance
## # A tibble: 29 x 2
##     year events
##    <dbl>  <dbl>
##  1  1983   16.0
##  2  1984   16.0
##  3  1985    0  
##  4  1986    0  
##  5  1987   49.0
##  6  1988   35.0
##  7  1989   74.0
##  8  1990   25.0
##  9  1991   18.0
## 10  1992   50.0
## # ... with 19 more rows

12. Data visualization

Many useful R packages exist out there to visualise data but here we use one of the popular packages called ggplot2.

Let’s for example select and plot flow measurements of the year 2000:

#1- subset data year 2000
one_year <- hflow %>% filter(year == 2000 ) 
# line plot 
ggplot(data = one_year, aes(x = datetime, y = flow))+
  geom_line(color="blue")

# histogram
ggplot(data = one_year, aes(x = flow))+
  geom_histogram(fill = "blue")

# box plot
ggplot(data = one_year, aes(x = datetime, y = flow, group = month))+
  geom_boxplot()

ggplot2 is a powerful graphic package and contains plenty of functions and arguments that allow adjusting/customizing graphics

Now let’s plot multiple graphics using the ggplot’s facet_wrap() function

# Facet wrap (for several plots)
# select for example 8 years of data 
  multi_yr <- hflow %>% filter(year %in% 2000:2008 ) 
# line graphs
  ggplot(data = multi_yr, aes(x = datetime, y = flow))+
  theme_bw()+ #simple theme
  geom_line(color = "blue")+
  facet_wrap(~year,scales = 'free_x')+
  scale_x_datetime(date_labels = "%b")

# histograms
  ggplot(data = multi_yr, aes(x = flow))+
    geom_histogram()+
    facet_wrap(~year,scales = 'free_x')

# Annual maximum flow + local polynomial regression fitting
ggplot(am_flow, aes(x = year, y = am_flow))+
  geom_line()+
  geom_smooth(method = "loess", se = T, span = 0.30,color = "blue")

Or let’s try some interactive plot using dygraph package where you can zoom and hover over the graphic to see details.

# Interactive plots  
hflow_ts <- as.xts(hflow,hflow$datetime)
hflow_ts[,5] %>% dygraph %>% dyRangeSelector()

13. Model fitting

The models here are not necessarily the right models to use for these data but it’s just to illustrate how you can include model fits and other ggplot graphics.

#simple linear model 
mod1 <- lm(events ~ year, data =exceedance)
#Poisson regression 
mod2 <- glm2(events ~ year, data =exceedance,family="poisson")
#mod2 summary
summary(mod2)
## 
## Call:
## glm2(formula = events ~ year, family = "poisson", data = exceedance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -10.045   -7.830   -1.280    3.062   10.777  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -25.882309   6.931474  -3.734 0.000188 ***
## year          0.014820   0.003469   4.272 1.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1174.6  on 28  degrees of freedom
## Residual deviance: 1156.2  on 27  degrees of freedom
## AIC: 1284.9
## 
## Number of Fisher Scoring iterations: 5
pred_y <- fitted(mod2)
#plot all in one graphic 
ggplot() + 
  theme_bw() +
  geom_point(data = exceedance, aes(x = year, y = events))+
  geom_line(data = exceedance, aes(x = year, y = events))+
  geom_smooth(data = exceedance, aes(x = year, y = events), 
              method = "lm",color = "red")+
  geom_smooth(data = exceedance, aes(x = year, y = events), 
              method = "loess",span = 0.30,se=T,color = "blue")+
  geom_line(data = exceedance, aes(x = year, y = pred_y), 
            colour ="blue",size = 1)

14. Further reading

This section is under preparation